Investigate a Dataset (TMDb Movie Data)

The primary goal of the project is to go through the dataset and the general data analysis process using numpy, pandas and matplotlib. This contain four parts:

Table of Contents

Introduction

Dataset

Questions

  1. Which year has the highest release of movies?
  2. Which Movie Has The Highest Or Lowest Profit? Top 10 movies which earn highest profit?
  3. Movie with Highest And Lowest Budget?
  4. Which movie made the highest revenue and lowest as well?
  5. Movie with shorest and longest runtime?
  6. Which movie get the highest or lowest votes (Ratings).
  7. Which Year Has The Highest Profit Rate?
  8. Which length movies most liked by the audiences according to their popularity?
  9. Average Runtime Of Movies From Year To Year?
  10. How Does The Revenue And Popularity differs Budget And Runtime? And How Does Popularity Depends On Profit?
  11. Which Month Released Highest Number Of Movies In All Of The Years? And Which Month Made The Highest Average Revenue?
  12. Which Genre Has The Highest Release Of Movies?
  13. Which genres are most popular from year to year?
  14. Most Frequent star cast?
  15. Top 20 Production Companies With Higher Number Of Release?
  16. Life Time Profit Earn By Each Production Company?
  17. Top 20 Director Who Directs Maximum Movies?
  18. What kinds of properties are associated with movies that have high revenues?

Data Wrangling

After Observing the dataset and the questions related to this dataset for the analysis we will be keeping only relevent data and deleting the unused data.

General Properties

Observation From The Dataset

  • The columns 'budget', 'revenue', 'budget_adj', 'revenue_adj' has not given.But for this dataset i will assume the currency is in US dollor.
  • The dataset contain lots of movies where the budget or revenue have a value of '0'.

Data Cleaning (Removing The Unused Information From The Dataset)

Information That We Need To Delete Or Modify

  1. We need to remove duplicate rows from the dataset
  2. Changing format of release date into datetime format
  3. Remove the unused colums that are not needes in the analysis process.
  4. Remove the movies which are having zero value of budget and revenue.

1. Remove Duplicate Rows

2. Changing Format Of Release Date Into Datetime Format

3. Remove the unused colums that are not needes in the analysis process

We can see that 21 columns in the dataset, We can drop the the colums which are not usable in the data analysis process. columns like: imdb_id,overview etc.
The columns like imdb_id, homepage,tagline, overview, budget_adj and revenue_adj are not required for my analysis and I will drop these columns.

4. Drop theses rows which contain incorrect or inappropriate values.

As you can see in this database of movies there are lots of movies where the budget or revenue have a value of '0' which means that the values of those variables of those movies has not been recorded. Calculating the profits of these movies would lead to inappropriate results. I think this may be due to varying factors like the lack of information, or the movies that were never released. I have chosen to eradicate these values during the data cleaning phase.

Since now we have the columns, rows and format of the dataset in right way, its time to investigate the data for the questions asked.

Exploratory Data Analysis

Tip: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

Research Question 1 : Which year has the highest release of movies?

After Seeing the plot and the output we can conclude that year 2014 year has the highest release of movies (700) followed by year 2013 (659) and year 2015 (629).

Research Question 2 : Which Movie Has The Highest Or Lowest Profit?

The first column shows the highest profit made by a movie and second column shows the highest in loss movie in this dataset.

As we can see that 'Avatar' movie Directed by James Cameron earn the highest profit in all, making over 2.5B in profit in this dataset.And the most in loss movie in this dataset is The Warrior's Way. Going in loss by more than 400M was directed by Singmoo Lee.

As we can see that 'Avatar' movie Directed by James Cameron earn the highest profit in all, making over 2.5B in profit in this dataset.And the most in loss movie in this dataset is The Warrior's Way. Going in loss by more than 400M was directed by Singmoo Lee.

Research Question 3 : Movie with Highest And Lowest Budget?

Movie Which Has Highest budget : The Warrior's Way
Movie Which Has Lowest budget : Fear Clinic (we extract it from table above the Graph)
This Graph Show us The Top 10 High budget Movies

Research Question 4 : Movie with Largest And Lowest Earned Revenue?

The first column shows the highest revenue made by a movie and second column shows the lowest revenue movie in this dataset. As we can see that 'Avatar' movie Directed by James Cameron made the highest revenue in all, making over 2.78B revenue in this dataset. And the movie which made lowest revenue is 'Wild Card' directed by Simon West.

Research Question 5 : Movie with Longest And Shortest Rintime?

So again the first column shows the runtime of the highest and second the lowest with column names as the index number.

I have never heard a runtime of a movie so long, Runtime of 900 min, that's approx 15 hrs! So 'The Story of Film: An Odyssey' movie has the highest runtime. This movie contain 6 or 7 parts that's why it is so long.

The name of the movie with shortest runtime is Fresh Guacamole, runtime of just 2 min! Woah! I have never seen such a short movie in my lifetime.

Movie Which Has Longest Rintime : The Story of Film: An Odyssey Runtime of 900 min
Movie Which Has Shortest Rintime : Fresh Guacamole Runtime of just 2 min (we extract it from table above the Graph)
This Graph Show us The Top 10 High Longest Movies

Research Question 6 : Movie with Highest And Lowest Votes?

The first column containt the movie with highest votes and second column contain the movie with lowest votes.
As we can see that movie 'The Story of Film: An Odyssey' has the maximum rating (92%), which was directed by Mark Cousins and movie with lowest user ratings is 'Transmorphers' with 15% user ratings, which was directed by Leigh Scott.

Movie Which Has Highest Votes : The Story of Film: An Odyssey has the maximum rating (92%)
Movie Which Has Lowest Votes : Fresh Guacamole has the minimum rating (15%) (we extract it from table above the Graph)
This Graph Show us The Top 10 Highest Rated Movies

Research Question 7 : Which Year Has The Highest Profit Rate?

According to the plot year 2002-03 he most profitable years And the profit was very low between the years 1960 and 1970.

Research Question 8 : Which length movies most liked by the audiences according to their popularity?

According to the plot we can say that movies in the range of 100-200 runtime are more popular than other runtime movies. Because it is boring to see the long duration movies.

Research Question 9: Average Runtime Of Movies From Year To Year?

According to the plot movie duration is decreasing year to year and it's TRUE. Because at this time nobody want to watch the long duration movies because it is quite boring. That's why the average runtime duraion of the movies are arround 100 Minutes.

Research Question 10: How Does The Revenue And Popularity differs Budget And Runtime? And How Does Popularity Depends On Profit?

We can see in This Graph That each year number of voters is increasing

These Are Estimated Values They Can Be Differ.

  • 1. Budget vs Revenue : Budget and revenue both have positive correlation(0.68) between them. Means there is a good possibility that movies with higher investments result in better revenues.
  • 2. Profit Vs Budget : Profit And Budget both have positive correlation(0.53) between them. Means there is a good possibility that movies with higher investments result in better Profit.
  • 3. Release Year Vs Vote Average : Release year and vote Average have negative correlation(-0.11). Means that movie ratings(vote average) does not depends on the release year.
  • 4. Popularity Vs Profit : Popularity and profit have positive correlation(0.61). It means that movie with high popularity tends to earn high profit.

Research Question 11: Which Month Released Highest Number Of Movies In All Of The Years? And Which Month Made The Highest Average Revenue?

According to the plot we can conclude that there are higher number of release in september and october month.

According tp the plot we can say that movie which release in may or june month made the high revenue in comparison of other month release. Or it can be happened because of outliers.

Research Question 12: Which Genre Has The Highest Release Of Movies?

According to the plot Drama(4761) genre has the highest release of movies followed by Comedy(3793) and Thriller(2908).

We can see in graph that Drama has most popular and in 2013 it has most popular than other year (2014, 2015) In each of Genres, graph show us how Genres is popular from year to another Graph Tell us The most popular Genres [Foreign, Documentary, Horror, ...., TV Movie Romance]

Graph tell us in each genre in each a decade of year How is the statement of genre ..
for example Horror:
in Horror We can see that in 1960 the popularity in this year 1.5 (standard unit)

Reasearch Question 14: Most Frequent Actor?

The most frequent actor that participate in movies is: Robert De Niro that has +70 movies (73) Graph show us The number of movies that each actor participates in it

Reasearch Question 15: Top 20 Production Companies With Higher Number Of Release?

This Grapgh tell us the number of movies that companies produce it (Production Companies)
For Example :
Universal Pictures Company that has +500 Movies that produces it

Reasearch Question 16: Life Time Profit Earn By Each Production Company

Graph show us what is the total profit for each production company>br> for example: Warner Bros Company has The largest profit than others with 3.5e10

Research Question 17 : Top 20 Director Who Directs Maximum Movies?

Graph Visulaized how many movies that director has made
for example:
Director: Woody Allen has +40 (approx 48) movies

Research Question 18: What Kind Of Properties Are Associated With Movies With High Revenue?

Graph Visualized Between Revenue Vs(Budget, Popularity, runtime, vote_avg)
We can see that Revenue Vs(Popularity, runtime) is more distribution than other two fields(vars)
for example about Revenue Vs vote_avg:
the most of data points is placed under revenue in range of ():0.5) with most vote_avg in range (4:8)

A brief description of the above plots,

Plot 1: Budget vs Revenue

The revenues do increase slightly at higher levels but the number of movies with high budgets seem scarce. There is a good possibility that movies with higher investments result in better revenues.

I can't find a relationship here. The revenues don't seem to change with higher vote average.

Plot 2: Popularity vs Revenue

The revenue seems to be increasing with popularity. We can say that if the popularity of movie is high then the revenue of the movie may be high.

Plot 3: Vote Average vs Revenue

The correlation between revenue and vote average is 0.2069. So vote average is not highly related to the revenue.

Plot 4: Runtime vs Revenue

The correlation between revenue and runtime is 0.2378. So runtime is not highly related to the revenue.

Conclusions

  • Drama is the most popular genre, following by action, comedy and thriller.
  • Drame, Comedy, Thriller and Action are four most-made genres.
  • Maximum Number Of Movies Release In year 2014.
  • 'Avatar', 'Star Wars' and 'Titanic' are the most profitable movies.
  • Short or Long duration movies are more popular than long duration movies.
  • Average runtime of the movies are decreasing year by year.
  • May,june,november and december are most popular month for releasing movies, if you want to earn more profit.
  • Revenue is directly connected to the budget.
  • Warner Bros, Universal Pictures and Paramount Pictures production companies earn more life time profit than other production companies.
  • Movies with higher budgets have shown a corresponding increase in the revenues.

Limitations

  • It's not 100 percent guaranteed solution that this formula is gonna work, But it shows us that we have high probability of making high profits if we had similar characteristics as such. If we release a movie with these characteristics, it gives people high expectations from this movie. This was just one example of an influantial factor that would lead to different results, there are many that have to be taken care of.
  • During the data cleaning process, I split the data seperated by '|' into lists for easy parsing during the exploration phase. This increases the time taken in calculating the result.